INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE 



Conflict-free Replicated Data Types 

Marc Shapiro, INRIA & LIP6, Paris, France 
NunO Preguica, Cm, Universidade Nova de Lisboa, Portugal 
Carlos Baquero, Universidade do Minho, Portugal 

Marek Zawirski, INRIA & UPMC, Paris, France 



N° 7687 

Juillet 2011 

Theme COM 




INK/ A 

ROCQUENCOURT 



Conflict-free Replicated Data Types * 

Marc Shapiro, INRIA & LIP6, Paris, France 
NuriO PregUiga, Cm, Universidade Nova dc Lisboa, Portugal 
Carlos Baquero, Universidade do Minho, Portugal 
Marek Zawirski, INRIA & UPMC, Paris, France 

Theme COM — Systemes communicants 
Projet Regal 

Rapport de recherche n° 7687 — Juillet 2011 — 18 pages 



Abstract: Replicating data under Eventual Consistency (EC) allows any replica to accept 
updates without remote synchronisation. This ensures performance and scalability in large- 
scale distributed systems (e.g., clouds). However, published EC approaches are ad-hoc and 
error-prone. Under a formal Strong Eventual Consistency (SEC) model, we study sufficient 
conditions for convergence. A data type that satisfies these conditions is called a Conflict- 
free Replicated Data Type (CRDT). Replicas of any CRDT are guaranteed to converge in 
a self-stabilising manner, despite any number of failures. This paper formalises two popular 
approaches (state- and operation-based) and their relevant sufficient conditions. We study a 
number of useful CRDTs, such as sets with clean semantics, supporting both add and remove 
operations, and consider in depth the more complex Graph data type. CRDT types can be 
composed to develop large-scale distributed applications, and have interesting theoretical 
properties. 
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tems 



* This research is supported in part by ANR project ConcoRDanT (ANR-10-BLAN 0208). Marek Za- 
wirski is supported in part by his Google Europe Fellowship in Distributed Computing 2010. Carlos Baquero 
is partially supported by FCT project Castor (PTDC/EIA-EIA/104022/2008). 



Unite de recherche INRIA Rocquencourt 
Domaine de Voluceau, Rocquencourt, BP 105, 78153 Le Chesnay Cedex (France) 
Telephone : +33 1 39 63 55 11 — Telecopie : +33 1 39 63 53 30 



Conflict-free Replicated Data Types 



Resume : Pas de resume 
Mots-cles : Pas dc motclef 



Conflict-free Replicated Data Types 



3 



1 Introduction 

Replication and consistency are essential features of any large distributed system, such as 
the WWW, peer-to-peer, or cloud computing platforms. The standard "strong consistency" 
approach serialises updates in a global total order [10]. This constitutes a performance 
and scalability bottleneck. Furthermore, strong consistency conflicts with availability and 
partition-tolerance [8] . 

When network delays are large or partitioning is an issue, as in delay-tolerant networks, 
disconnected operation, cloud computing, or P2P systems, eventual consistency promises 
better availability and performance [17, 20]. An update executes at some replica, without 
synchronisation; later, it is sent to the other replicas. All updates eventually take effect 
at all replicas, asynchronously and possibly in different orders. Concurrent updates may 
conflict; conflict arbitration may require a consensus and a roll-back. 1 

This weaker consistency is considered acceptable for some classes of applications. How- 
ever, conflict resolution is hard. The literature offers little guidance on designing a correct 
optimistic system. Ad-hoc approaches are brittle and error-prone; witness for instance the 
concurrency anomalies of the Amazon Shopping Cart [3] . 

We propose a simple, theoretically-sound approach to eventual consistency. Our system 
model, Strong Eventual Consistency or SEC, avoids the complexity of conflict resolution and 
of roll-back. Conflict-freedom ensures safety and liveness despite any number of failures. It 
leverages simple mathematical properties that ensure absence of conflict, i.e., monotonicity 
in a semi-lattice and/or commutativity. A trivial example is a replicated counter, which (as- 
suming no overflow) converges because its increment and decrement operations commute. 
In our conflict-free replicated data types (CRDTs), an update does not require synchroni- 
sation, and CRDT replicas provably converge to a correct common state. CRDTs remain 
responsive, available and scalable despite high network latency, faults, or disconnection. 

Non-trivial CRDTs are known to exist: for instance, we previously published Treedoc, a 
sequence CRDT for co-operative text editing [14]. Our aim here is to expand our knowledge 
of the principles and practice of CRDTs. We claim the following contributions for this paper: 

• A solution to the CAP problem, Strong Eventual Consistency (SEC). 

• Formal definitions of Strong Eventual Consistency (SEC) and of CRDTs. 

• Two sufficient conditions for SEC. 

• A strong equivalence between the two conditions. 

• We show that SEC is incomparable to sequential consistency. 

• Description of basic CRDTs, including integer vectors and counters. 

• More advanced CRDTs, including sets and graphs. 

1 A conflict is a combination of concurrent updates, which may be individually correct, but that, taken 
together, would violate some invariant. 
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Figure 1: State-based replication 



Figure 2: Operation-based replication 



We refer the interested reader to a separate technical report [18] for a comprehensive 
portfolio of CRDT designs. 

2 System model 

We consider a system of processes interconnected by an asynchronous network. The network 
can partition and recover. We assume a finite set II = {po, . . . , p„_i} of non-byzantine 
processes. Processes in II may crash silently; a crashed process may remain crashed forever, 
or may recover with its memory intact. A non-crashed process is said correct. 

2.1 State-based object 

In this section we specify replicated objects in the so-called state-based style. The intuition 
is illustrated in Figure 1. Executing an update modifies the state of a single replica. Every 
replica occasionally sends its local state to some other replica, which merges the state thus 
received into its own state. In this way, every update eventually reaches every replica, either 
directly or indirectly. 

With no loss of generality, we consider a single object with one replica at each process. 
An object is a tuple (S,s°,q,u,m). The replica at process Pi has state Sj € S, called its 
payload; the initial state is s°. A client of the object may read the state of the object via 
query method q and modify it via update method u. Method m serves to merge the state 
from a remote replica. A method (whether q, u or to) executes at a single replica. 

Systems that deliver every update to every replica eventually in a fault-tolerant manner 
are well-known in the literature, for instance gossip or anti-entropy approaches [5, 13]. For 
simplicity, we will assume hereafter a fully connected communication graph, where every arc 
is a fair-lossy channel. Infinitely often, the replica at pi sends (if it is correct) its current state 
to pj; replica pj (if it is correct) merges the received state into its local state by executing 
method to. 

A method whose precondition is satisfied is said enabled. We assume that an enabled 
method executes as soon as it is invoked. Method executions at some replica are numbered 
sequentially from 1. The k th method execution at replica i will be noted (o), where / 
is either q,u or to, and a denotes the arguments. We note Ki(f) the ordinal of execution 
/ at replica i, i.e., Ki{f^{a)) = k for i = j, and is undefined otherwise. (Abusing nota- 
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tion somewhat, we may drop subscripts, superscripts and/or arguments when there is no 
ambiguity. ) 

The states of a replica are numbered sequentially incrementing with each method exe- 
cution. Thus, replica i has initial state = s°. Before its k th execution of a method it has 
state s^ _1 , and s\ afterwards. We note the transition s^ _1 • /f(a) = sf . 

We define state equivalence s = s' if all queries return the same result for s and s'. A 
query has no side-effects, i.e., (s • q) = s. 

Definition 2.1 (Causal History (state-based)). We define the object's causal history C = 
[ci, . . . , c n ] (where Ci goes through a sequence of states c?, . . . , c\, . . . ) as follows. Initially, 
= 0, for all i. If the fc th method execution at i is: (i) a query q: the causal history does 
not change, i.e., c\ — c* -1 ; (ii) an update (noted u^(a)): it is added to the causal history, 
i.e., c\ = c\~ x U \u\(a)\; (iii) a merge m\{s^,), then the local and remote histories are 
unioned together: c\ — c\~ x U c*i . 

We say that an update is delivered at some replica when it is included in the causal history at 

that replica. An update u happened-before u' iff u is delivered when v! executes: u — > u' d = 
u G Cj~ l , where vf executes at replica pj and Kj(vf) — k. Updates are concurrent if neither 

happened-before the other: u \\ u' d = u -ft v! A vl u. Note that the causal history is a 
formal reasoning device, which is normally not needed in a concrete implementation. 

Given our communication assumptions, we can conclude that, in a state-based object, 
every update is eventually delivered to all replicas. However, this is not sufficient to ensure 
that replicas converge. For instance, if the merge method m is a no-op, an update executed 
at some replica has no effect on other replicas, and they will never converge. 

2.2 Strong Eventual Consistency 

Informally, eventual consistency means that replicas eventually reach the same final value if 
clients stop submitting updates. We capture this intuition as follows: 

Definition 2.2 (Eventual Consistency (EC)). Eventual delivery: An update delivered at 
some correct replica is eventually delivered to all correct replicas: Vz, j : / € Cj => 0/ € 
'•./• 

Convergence: Correct replicas that have delivered the same updates eventually reach equiv- 
alent state: Vi,j : Dcj = Cj (}Osi = sj. 
Termination: All method executions terminate. 

Several EC systems will execute an update immediately, only to discover later that it 
conflicts with another, and to roll back to resolve this conflict [19]. This constitutes a waste 
of resources, and in general requires a consensus to ensure that all replicas arbitrate conflicts 
in the same way. To avoid this, we require a stronger condition: 

Definition 2.3 (Strong eventual consistency (SEC)). An object is Strongly Eventually Con- 
sistent if it is Eventually Consistent and: 
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Strong Convergence: Correct replicas that have delivered the same updates have equiva- 
lent state: Vi, j : Cj = Cj Si = Sj . 

2.3 State-based Convergent Replicated Data Type (CvRDT) 

We now propose a sufficient condition for strong convergence in state-based objects. A join 
scmilattice (or just semilatticc hereafter) is a partial order < equipped with a least upper 
bound (LUB) U for all pairs: m — x U y is a Least Upper Bound of {x,y} under < iff 
Vm! ,x <m'Ay<m'=>x<mAy<TnAm<m'. It follows that U is: commutative: 
x U y — y U x; idempotent: x U x = x; and associative: (x U y) U z = x U (y U z). 

Definition 2.4 (Monotonic semilattice object). A state-based object, equipped with partial 
order <, noted (S,<,s ,q,u,m), that has the following properties, is called a monotonic 
semi-lattice: (i) Set S of payload values forms a semilattice ordered by <. (ii) Merging state 
s with remote state s' computes the LUB of the two states, i.e., s»m(s') = sUs'. (iii) State 
is monotonically non- decreasing across updates, i.e., s < suu. 

Theorem 2.1 (Convergent Replicated Data Type (CvRDT)). Assuming eventual delivery 
and termination, any state-based object that satisfies the monotonic semilattice property is 
SEC. 

Proof. As we assumed earlier a fully connected communication graph and that replicas 
transmit and merge state infinitely often, the conditions for eventual delivery arc fulfilled. 
With no loss of generality, we assume that every operation is enabled (otherwise its invoca- 
tion reduces to a no-op); furthermore we already assumed that an operation executes at a 
single replica. Under these conditions, an operation terminates if it has no infinite loops or 
recursion, which we assume to be true. 

We now focus on proving strong convergence. We first note that Definition 2.4 precludes 
spontaneous state changes or roll-backs: when a replica is in some state s, it can change 
state only by executing an update u or a merge m. 

Consider replicas i and j. The proof assumption is Cj = Cj. Since updates are unique, 
these replicas can only have delivered the same updates in the following conditions: 

• They are in the initial state, and therefore = Sj. 

• During the execution, there was a point p, q when cP { C cj. In pi, for k > p there is 
a merge that included a state s : s = sj and all /* operations are non mutating or 
merges with s C Sj. In pj, for k > q, all operations fj are non mutating or merges 
with s C sj. Therefore, since U is a LUB, one has Sj = Sj(= sj). 

• During the execution, there was a point p,q when c. C c\. Proved by simmetry with 
the previous case. 
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• During the execution, there was a point p, q when cf (jL c | and (jL c\. In Pi, for k > p 
there is a merge that included a state s : s j C s C U and all /* operations are 
non mutating or merges with sCsJUsJ. Converselly in pj , for k > q there is a merge 
that included a state s:S;CsCs|Usf and all n 8 operations are non mutating or 
merges with s C U sf . In these conditions, Cj = Cj = U cj and due to the LUB 
properties of U we have Sj = Sj(= U s q - = U s?). 

Since replicas transmit and merge state infinitly often, these conditions will occur in- 
finitly often. Finally, by U transitivity all replicas that deliver the same updates will depict 
equivalent states. 

□ 

A CvRDT converges towards the LUB of the most recent updates. We require that 
x<y/\y<x^>x = y. 

2.4 Op-based Commutative Replicated Data Type (CmRDT) 

Alternatively to the state-based style, a replicated object may be specified in the operation- 
based (or op-based) style. An op-based object is a tuple (S, s , q, t, u, P), where S, s° and q 
have the same meaning as above (respectively state domain, initial state and query method). 
An op-based object has no merge method; instead an update is split into a pair (t,u), 
where t is a side-effect-free prepare-update method and u is an effect-update method, (their 
arguments may differ, e.g., t(a) and u(a!) in Figure 2). The prepare-update executes at the 
single replica where the operation is invoked (its source). At the source, prepare-update 
method t is followed immediately by effect-update method u, i.e., /j = t => /* = u. (If 
this were not true, there would be no causality between successive updates.) 

The effect-update method executes at all replicas (said downstream) . The source replica 
delivers the effect-update to downstream replicas using a communication protocol specified 
by the delivery relation P, explained below. 

We use the same notations for states and causal history as above, except that now / can 
refer to any of q,t or u. Both queries and prepare-update methods are side-effect-free, i.e., 
s»q=s»t = s. 

Definition 2.5 (Causal History (op-based)). An object's causal history C — {ci, . . . ,c n } 
is defined as follows. Initially, c° = 0, for all i. If the k th method execution at i is: (i) a 
query q or a prepare-update t, the causal history does not change, i.e., c\ = c^" 1 ; (ii) an 
effect-update v$(a), then c\ = c*^ 1 U {itj'(a)}. 

An update is said delivered at a replica when the update is included in the replica's 
causal history. Update (t,u) happened-before (t',u') iff the former is delivered when the 
latter executes: (t,u) — > (t',u') 4$ u £ cj* _1 , where t' executes at pj and k — Kj(t'). The 
definition of concurrent updates remains as above. 

We assume an underlying reliable causally-ordered broadcast communication protocol, 
i.e., one that delivers every message to every recipient exactly once and in an order consistent 
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with happened-before. Such protocols are a standard feature of distributed systems; they 
do not require consensus and they deliver to all correct processes as long as any network 
partition eventually recovers (as we assumed earlier). It follows that two updates that are 
related by happened-before execute at all replicas in the same sequential order: (£, u) — > 
{t' ,u') yi,Ki(u) < Ki(u'). However, concurrent updates may be delivered in any order. 

Definition 2.6 (Commutativity). Updates (t,u) and (t',u') commute, iff for any reachable 
replica state s where both u and v! are enabled, u (resp. v! ) remains enabled in state s • v! 
(resp. s • u), and s • u • vf = s • vl • u. 

Clearly a sufficient condition for convergence of an op-based object is that all its con- 
current operations commute. An object satisfying this condition is called a Commutative 
Replicated Data Type (CmRDT). 

P is a delivery precondition, i.e., effect-update method u is enabled only if the precon- 
dition is satisfied. We interpret this temporally, i.e., delivery of u at replica i may delayed, 
until P(si, u) is true. Therefore, for liveness, we now have the added obligation to prove that 
delivery is eventually enabled. Therefore we restrict our scope to preconditions for which 
causally-ordered broadcast is sufficient to ensure P. 

Theorem 2.2 (Commutative Replicated Data Type (CmRDT)). Assuming causal delivery 
of updates and method termination, any op-based object that satisfies the commutativity 
property for all concurrent updates, and whose delivery precondition is satisfied by causal 
delivery, is SEC. 

Proof. We assume delivery of updates to all correct replicas by reliable causal broadcast, 
which fullfils their delivery specification P. Once delivered operations cannot be undelivered. 
We also assume that the all CmRDT methods are well formed and terminate. Thus, we now 
focus on proving strong convergence. 

Consider any two correct replicas pi and pj. Under the assumptions, eventually the the 
two replicas will deliver the same operations (if no new operations are generated), and we 
have Ci — Cj. For any two updates f(a),f'(a') in cf. (i) If f(a) — > /'(a'), then by causal 
delivery assumption, the apply order is consistent with causality Ki(f(a)) < Ki(f (a')); 
(ii) If they are not causally related, /(a) || f'(a'), then they must commute and can be 
delivered in any relative order. In any replica pi, both apply orders, Ki(f(a)) < Ki(f'(a')) 
and Ki(f(a)) > Ki(f'(a')), lead to the same effect. In all cases an equivalent abstract state 
Si = Sj is reached in the two replicas. 

By transitivity, Vi,j : = Cj s, ; = Sj. 

□ 

3 Some results 

3.1 Fault-tolerance and the CAP theorem 

The CAP theorem states that it is impossible to simultaneously ensure strong consistency 
(C), availability (A) and tolerate network partition (P) [8]. As, network faults unavoidably 
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occur in a large-scale environment, a real system must sacrifice either consistency or avail- 
ability. Availability is often the top priority in practice [3]: does this mean giving up all 
consistency guarantees? 

No: SEC provides a solution. A SEC replica is always available for both reads and writes, 
independently of network conditions. Any communicating subset of replicas of a SEC object 
eventually converges, even if partitioned from the rest of the network. SEC is weaker than 
strong consistency but nonetheless provides the well-defined guarantee of strong eventual 
convergence. 

SEC provides an extreme form of fault tolerance, as a SEC object tolerates up to n — 1 
simultaneous crashes. Remarkably, SEC does not require to solve consensus. 

3.2 CvRDTs and CmRDTs are equivalent 

3.2.1 Operation-based emulation of a state-based object 

Theorem 3.1 (CmRDT emulation). Any SEC state-based object can be emulated by a SEC 
op-based object of a corresponding interface. 

Proof. Given a CvRDT represented by tuple (5, <,s ,q,u, to), we emulate it by a CmRDT 
object (S, s , q, t, u', P), which we specify hereby. 

State and query of CvRDT can be directly stored and processed by emulating CmRDT 
using the same definitions. A prepare-update t(a) has the same interface (accepts the same 
domain of arguments and returns the same domain of value) as an update u(a). It records 
the result of applying update u(a) on a copy of current replica state s: s' — s • u(a); return 
value of u(a) is passed to the client. Recorded state s' is used as an argument of an actual 
effect-update u'(s'), which is delivered to all replicas by the underlying protocol of CmRDT. 
Precondition P is unrestricted and enables delivery at any time. Effect-update u'(s') merges 

def 

received state using original CvRDT method: s • u'(s') = s • m(s'). 

Since merge always commutes, then updates u'(s') commute and since the communica- 
tion is reliable, we have a CmRDT with strong eventual consistency, which propagates all 
updates of emulated CvRDT. □ 

3.2.2 State-based emulation of an operation-based object 

State-based emulation of an operation-based object essentially formalises the mechanics of 
an epidemic reliable causal broadcast. 

Theorem 3.2 (CvRDT emulation). Any SEC op-based object can be emulated by a SEC 
state-based object of a corresponding interface. 

Proof. Given a CmRDT represented by tuple (S,s°,q,t,u,P), we emulate it by a CvRDT 
object ((S x U x {/),<, (s°, 0,0), q',u',m), which we specify hereby. 

Without loss of generality, we assume that each invocation u\ is unique across repli- 
cas and set U denotes all possible updates. CvRDT's state is then defined as a triple 
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(s m , M, D), where s m is a state of emulated CmRDT, M and D are two add-only sets of, re- 
spectively, known and delivered updates. A relation < is defined as following: (s m , M, D) < 

(s' m , M', D') = M C M' AD CD'. 

A query q'(a) has the same interface as q(a); we define it as a trivial delegation to q(a) 
on the CmRDT, s m • q(a). An update u'(a) has the same interface as prepare-update t(a). 
It first delegates the invocation to prepare-update t(a) of the CmRDT that in turn triggers 
effect-update u(a), which becomes a locally known update. Finally, u'(a) uses a recursive 
function d to process updates: 

d(s M D) d = [ d(ySm * u(yCL ^ M ' DU {"( a )i) if 3u(a) £ M\D : P(s m ,u(a)) 
}(s mi M,D) otherwise 

Hence, u'(a) is defined as: (s m , M, D) • u'{a) = f d{s m • t(a),M U {u(a)}, D). 

Finally, merge m takes a union of known messages and processes available updates: 
(s m , M, D) • m(s' m , M', D 1 ) = f d(s m , M U M' , D). 

Since the emulation ensures that messages are delivered exactly once to each replica's 
embedded object, in the appropriate order, and since the CvRDT conforms to SEC criteria, 
the embedded CmRDT instance is also SEC. 

□ 

Note that the emulating object forms a monotonic semilattice over domain S x U x U. 
Calling or delivering an operation adds it to the relevant message set, and therefore advances 
the state in the partial order. The merge method m is defined to take the union of the M 
sets and (possibly) updating D, and is thus a LUB operation. This construction is similar 
to Wuu and Bernstein's log covered in Section 4.2. 



3.3 SEC is incomparable to sequential consistency 

A state-based replica executes a sequence of query, update, and merge methods. In addi- 
tion to its sequential behaviour, a CRDT specifies concurrent behaviours that must satisfy 
the strong convergence property. As we show now, this permits executions that would be 
impossible in a sequentially-consistent system. 

Consider a Set CRDT S with operations add(e) and remove(e). Immediately after 
add(e), the state will satisfy e 6 S; after remove(e) the state satisfies e ^ S. In a sequential 
execution, the last update wins, e.g., after remove(e) — > add(e) the state satisfies e £ 
S. Concurrent adds or removes of different elements are independent, e.g., after add(e) \\ 
remove(e') the state satisfies e £ S A e' S. 

There is a choice of alternative semantics for concurrent updates of the same element. 
When concurrently adding and removing the same element, the add could win, or the remove 
could win, or the update of the replica with the highest IP address could win, or the state 
might be reset to a distinguished state _L, and so on. All these alternatives satisfy the strong 
convergence condition, and any of them may be reasonable for some application. 
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Let us consider the add-wins alternative: after add(e) \\ remove(e) the state satisfies e £ 
S. Now consider the following scenario. Replica po executes the sequence add(e); remove(e'). 
Concurrently, replica pi executes add(e'); remove(e). Then, replica p 3 merges the state 
from po and p\. According to the concurrent specification, the final state at pj, satisfies 
e G S A e' G S. Such a state would never occur in a sequentially-consistent execution, in 
which cither remove(e) or remove(e') must be last. Thus, there is a SEC object that is not 
sequentially consistent. 

Now consider the converse. In the absence of crashes, a sequentially-consistent object is 
SEC. Indeed, sequential consistency is defined by a single order of operations, after which 
all replicas must terminate with the same state. However, in the general case, sequential 
consistency requires consensus, which cannot be solved in the presence of n — 1 crashes. 
Therefore, SEC is incomparable with sequential consistency. 

4 Example CRDTs 

We now recall some basic CRDTs that are known in the existing literature, which we will 
later compose to build higher-level objects. We will use state- or op-based specifications as 
most convenient. Generally, we find the state-based style more compact and easier to reason 
about formally, whereas the op-based style is often convenient for implementation. 

4.1 Integer vectors and counters 

Consider the state-oriented specification of a vector-of-integers object: (N n , [0, . . . , 0], < n 
, [0, . . . , 0], value, inc, max"). Vectors v, v' G N™ are (partially) ordered by v < n v' ^> Vj G 
[0..B — < v'[i]. A query invocation valuei) returns a copy of the local payload. An 

update inc(i) increments the payload entry at index i, that is, s»inc(i) = [s'[Q], . . . , s'[n — 1]] 
where s'[j] = s[j] + 1 if i = j and s'[j] = s[j] otherwise. Merging two vectors takes the 
per-index maximum, i.e., s»max"(s') = [max(s[0], s'[0] ),..., max(s[n — l],s'[n— 1])]. We 
omit the proof that it is a CRDT. 

If each process pi is restricted to incrementing its own index inc(i), this is the well-known 
vector clock [11]. 

An increment-only integer counter is very similar; the only difference being that query 

invocation value{) of a vector in state v returns |i>| = f Y] ■ v\j]. We construct an inte- 
ger counter that can be both incremented and decremented, by basically associating two 
increment-only counters / and D, where incrementing increments / and decrementing in- 
crements D, whereas value() returns |/| — \D\. The ordering method < is defined as 

{I, D) < (/', D 1 ) =' / <" I 1 A D < n D 1 . 

4.2 U-Set, map and log 

Another simple CRDT construct is an add-only set object (S, G, 0, value, add(e), U). The 
payload is any set; sets are ordered by inclusion. A query valueQ returns a copy of the 
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local payload. Update add(e) adds element e to the set, i.e., s • add(e) — s L) {e}. It is 
well-known that sets ordered by C form a semi-lattice with U as the LUB operator. It is 
clearly monotonic by the definition of add. Therefore, the add-only set is a CRDT. 

Wuu and Bernstein build further CRDTs by combination of these basic components [22] . 
They propose a set with both add and remove operations by associating two add-only sets 
A and R; adding an element adds it to A, removing it adds it to R; query value{) returns 
the set difference A\R. (R is often called the tombstone set. A client is allowed to remove 
only an element that is currently in A) . Note that they assume that every element is unique 
and added only once; we call their construct U-Set [18]. Wuu and Bernstein derive their 
Dictionary data type from U-Set in the obvious way. 

A Log is a replicated object, whose payload contains a set (initially empty) of (event, timestamp) 
pairs. It assumes that each process maintains a vector clock in the usual manner [11]. When 
an event e occurs at process i, the process invokes update add(e); the update method up- 
dates the vector clock (say, to state v) and adds the pair (e,v) to the set. The timestamp 
ensures that each entry is unique. The merge method takes the union of the local and a 
remote set. 

To avoid unbounded growth, Wuu and Bernstein propose a distributed garbage collection 
algorithm that discards unneeded entries. In order to tolerate n — 1 crashes, only an entry 
that has been delivered to all processes may be discarded. If vector clock entry Vi[j] = k, this 
implies that process i has delivered all k first events of process pj . Each replica maintains 
in its payload a copy of all remote vector clocks; for each remote site, the merge procedure 
keeps the largest version. Then, a replica may discard a log entry as soon as its timestamp 
is less than all the remote vector clocks. This algorithm does not require a consensus, but 
it is live only if no process is crashed. However, this may be acceptable, since the liveness 
of garbage collection does not impact the correctness of the main algorithm. 

This algorithm may be adapted to other data types, for instance to discarding the A and 
R entries of a removed element in the U-Set. 

5 Directed Graph CRDT 

Now let us examine how one would design a more complex data type: a Directed Graph 
CRDT. Graphs arc an important general-purpose data structure. Some important applica- 
tions and algorithms work on graphs, e.g., shortest-path or web page-rank. 

5.1 Thought experiment 

To motivate our graph design, consider the "thought experiment" of designing a web search 
engine. The search engine uses a directed graph representing the web structure. This graph 
may be used, among other things, to compute page rank. Such an application processes 
large amounts of data and performs many updates. For efficiency and scalability, processing 
should be asynchronous; for responsiveness, processing should be incremental, as fast as 
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payload set V, A 

initial 0,0 
query lookup (vertex v) : boolean b 

let b = (Bid : {v, w) G V) 
query lookup (arc (v',v")) : boolean b 

let b = (lookup(v') A lookup(v") A (3w : {(v',v"),w) £ A) 
update addVertex (vertex v) 

prepare (v) : w 

let w = uniqueQ 

effect (v, w) 

V := VU{(v,w)} 
update removeVertex (vertex v) 

prepare (v) : R 

pre lookup(v) 

pre jBv' : lookup({v,v')) 

let R = {(v,w)\3w : (v,w) £ V} 
effect (R) 

V := V \R 

update addArc (vertex v' , vertex v") 
prepare (v' , v") : w 
pre lookup(v') 
let w = uniqueQ 
effect (v' , v" , w) 

A := A U {((«', v"),w)} 
update removeArc (vertex v' , vertex v") 
prepare (v',v") : R 
pre lookup((v' ,«")) 

let _R = {((v',v"),w)|3u) : ((«',t)"),w) £ A} 
effect (_R) 

A:=A\il 



sets of pairs { (element e, unique-tag w), . . .} 
— V: vertices; A: arcs 



uniqueQ returns a unique value 
— v -j- unique tag 



— precondition 
— v is not the head of an existing arc 
Collect all unique pairs in V containing v 



— head node must exist 
— uniqueQ returns a unique value 

— (v',v'Q + unique tag 

— arc(v' ,i>") exists 
— Collect all unique pairs in A containing arc (v' , v") 



Figure 3: Directed Graph Specification (op-based) 



each page is crawled. Processing should not require any synchronisation, e.g., transactions. 
A CRDT could be ideal. 

We start with a Set CRDT containing some initial URLs to be crawled. A number of 
crawler processes run in parallel; each one removes some URL from the set and downloads 
it. (It might happen that the same page is downloaded twice but this does not impact 
correctness.) 

When a crawler finds a new page, it executes the corresponding addVertex. For every 
page, it parses the links that it contains, comparing it with the page's previous version, if 
any, and executes the corresponding addArc and removeArc invocations. Finally, the URLs 
of the linked pages are added to the set to be crawled. Note that addArc must work even 
if the page at the tail of the arc has not yet been found (it might not even exist), but such 
an arc is not functional; a lookup of the corresponding arc should fail. Similarly if a node 
has been removed, all arcs incident to the node disappear. In this way, the behaviour of our 
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CRDT will be consistent with that of web pages, which are allowed to contain non-functional 
URLs. Once the linked page is created, the link become relevant, e.g., for navigation and 
for page-rank computation. 

In the web application, the graph is very large; sending the state between replicas and 
merging would be very costly. Therefore, we choose an op-based approach. 

5.2 Design alternatives for arc removal 

A directed graph is a pair of sets (V,A), called vertices and arcs respectively, such that 
A C V x V. Updates must maintain the invariant that the head and tail vertices of 
an arc both exist. Therefore, adding an arc to A has the precondition that its two ver- 
tices are in V; conversely, a vertex may be removed only if it supports no arc; these are 
preconditions to prepare-update. Furthermore, the system must ensure that concurrent 
addArc{v' ,v") \\ removeVertex(v') do not violate the invariant. Several alternatives may be 
considered: (i) Give precedence to removeVertex{v')\ all edges to or from v' are removed as 
a side effect. This is easy to implement, by hiding any arc that includes a removed vertex. 
(ii) Give precedence to addArc(v' ,v"): if either v' or v" has been removed, it is restored. 
This requires recreating nodes that have being explicitly deleted. (Hi) removeVertex(v') is 
delayed until all concurrent addArc operations have executed. This requires synchronisa- 
tion which violates the goals of asynchrony and fault tolerance. There is no perfect choice. 
Hereafter, we choose Option (i) because it is adequate in our application scenario. 

5.3 Graph specification 

Figure 3 shows our specification for a Directed-Graph CRDT. In the next section, we prove 
that this object is indeed a CmRDT. 

This CRDT maintains two sets internally, one for the vertices and one for the arcs. To 
add a vertex v, the prepare-update method creates a unique identifier, w, and the effect- 
update method adds the pair (v,w) to the set of vertices. With this approach, each vertex 
has an unique internal identifier. If the same vertex is added twice, the two additions will 
be distinguished by their two unique identifiers. A lookup will mask the duplicates. 

To remove vertex v, the prepare-update computes the set R of pairs that contain v, i.e., 
all copies known in the source replica; the effect-update method removes this same set R from 
the set of vertices in all replicas. As operations are delivered in causal order, when the effect- 
update method executes in some replica, for each pair in R, the correspondent addVertex 
operations has already executed. Thus, unlike the state-based solution of Section 4.2, a set 
need not keep tombstones. 

If the same vertex is removed and added concurrently, the addVertex wins, as the new 
unique identifier is not included in the set computed by the remove's prepare-update. This 
approach is consistent with a sequential execution, as the a vertex can removed only if it is 
observed. The same approach is used for arcs. 

To remove a vertex, the source replica checks that the vertex is observed, and also that 
it is not the head of any existing arc. Conversely, to add an arc, its head node must exist, 
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but there is no check for existence of the tail. The lookup method will mask the existence 
of such an arc. However, if the tail is added later, then the arc becomes visible. Similarly, 
concurrent updates may remove a vertex that is the head of an arc. However, the lookup 
method will mask such an arc. 

5.4 Proof that Directed Graph is a CRDT 

In this section, we prove that the specification of Figure 3 represents a CRDT. As effect- 
updates are always enabled, and as inspection of the code shows that every method execution 
terminates, termination follows. 

Lemma 5.1. addVertex(v') and addVertex(v") commute. 

Proof. addVertex(v') generates a unique identifier u'; addVertex(v") generates unique iden- 
tifier u" . For any initial state S = (V, A), whatever order both operations are executed, the 
final state is the same S • addVertex(v') • addVertex{v") = (V U {(«', u')} U {(«", u")}, A) = 
S • addVertex{v") • addVertex{v') = (V U {(«", u")} U {(«', it')}, A). 

□ 

Lemma 5.2. remove Vertex{v') and removeVertex{v") commute. 

Proof. removeVertex{v') computes a set, i?', of pairs to be removed; remove Vertex(v") com- 
putes set R" . For any initial state S = (V, A), whatever order both operations are executed, 
the final state is the same S • removeVertex{v') • removeVertex(v") = (V \ R' \ R" , A) = 
S • remove Vertex(v") • removeVertex(v') = (V \ R" \ R' , A). 

□ 

Lemma 5.3. Concurrent addVertex(v') and removeVertex(v") commute. 

Proof. addVertex(v') generates unique identifier vf; removeVertex(v") generates set R". 
(v',u') £ R" as u' is a fresh unique identifier. Thus, for any initial state S = (V,A), 
whatever order both operations are executed, the final state is the same S • addVertex(v') • 
removeVertex(v") = (V U {(v',u')} \ R",A) = S • removeVertex(v") • addVertex(v') — 
(V\R"U{(v',u')},A). 

□ 

Proofs for arcs are similar, so we omit them. We finally need to prove that any operation 
on vertices and arcs commute. However, as operation on vertices and operation on arcs 
modify disjoint internal sets, it is immediate that executing both operations in any order 
will lead to the same state. 

Theorem 5.1. Specification of Figure 3 represents a CmRDT. 

Proof. Effect-update methods are always enabled, and any pair of concurrent operations 
commute, per the lemmas above. 

□ 
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6 Comparison with previous work 

Eventual consistency has been an active topic of research in highly- available, large-scale 
asynchronous systems [17, 20]. Contrary to much previous work [3, for instance], we take a 
formal approach grounded in the theory of commutativity and semilattices. 

The state-based approach was invented for register-like objects, where the only update 
operation is assignment. It is in wide use in file systems such as NFS, AFS or Coda, and 
in key-value stores such as Dynamo [3] and Riak. Op-based approaches are used when the 
cost of transferring state is too high, e.g., databases, and when operation semantics are 
important, e.g., cooperative systems such as Bayou [13] or IceCube [15]. 

Although the CRDT concept was identified only recently, related designs have been 
published before. Johnson and Thomas invented the LWW- Register [9]. They propose 
a database of registers that can be created, updated and deleted, using the last-writer- 
wins (LWW) rule to arbitrate between concurrent changes. LWW ensures a total order of 
operations, at the cost of losing concurrent updates. 

Concurrent editing uses the related concept of Operational Transformation (OT), due 
to Ellis and Gibbs [7]. To ensure responsiveness, a local operation executes immediately. 
Operations are not designed to commute; however, a replica receiving an update transforms 
it against previously-executed concurrent updates to achieve a similar result. OT algorithms 
for a decentralised architecture have been proposed; Oster et al. show that most of them are 
incorrect [12]. We believe that designing for commutativity from the start is cleaner and 
simpler. 

The foundations of CvRDTs were introduced by Baquero and Moura [1] . We extend their 
work with CmRDTs and with a number of new results. The CRDT concept was invented by 
Shapiro and Preguiga on their work on Treedoc, a Sequence CRDT for concurrent editing 
[14]. Logoot is another Sequence CRDT that supports an undo mechanism based on a 
CRDT Counter [21]. 

Roh et al. [16] independently developed the related concept of Replicated Abstract Data 
Type. They generalise LWW to a partial order of updates, which they leverage to build 
several LWW-style classes. 

Burckhardt and Leijen propose the Concurrent Revisions programming model for shared 
abstract data types, in which a forked revision runs in isolation until it joins again. Join is 
based on a three-way merge function [2] . They show that simple sequential merge functions 
exist for ADTs built upon Abelian groups. We have also demonstrated the relation between 
CRDTs and sequential consistency in a similar, but more loosely-coupled, replication model. 

Ducourthial et al. study algebraic structures with specific properties in order to solve 
self-stabilisation problems [6]. They propose the so-called r-operator for "silent" tasks [4]. 
Strong convergence can be seen as as a silent task, given a limited number of disturbing up- 
dates. However, there are differences between the two approaches. Whereas a self-stabilising 
system must tolerate arbitrary memory corruption, a shared mutable object should change 
state durably only by executing update operations. Furthermore, whereas CvRDT states 
constitute a monotonic semi-lattice, the r-operator requries a total order. 
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7 Conclusion 

We presented the concept of a CRDT, a replicated data type for which some simple math- 
ematical properties guarantee eventual consistency. In the state-based style, the successive 
states of an object should form a monotonic semilattice, with merge computing a least up- 
per bound. In the op-based style, concurrent operations should commute. Assuming only 
that the communication subsystem ensures eventual delivery (in causal order for op-based 
objects), CRDTs are guaranteed to converge towards a common, correct state, without 
requiring any synchronisation. 

We presented some simple CRDT examples, such as sets, and detailed how to create a 
directed Graph CRDT, which might be used in a large-scale web search engine. Our data 
types have a clean and deterministic semantics in the presence of concurrent updates. 

Eventual consistency is a critical technique in many large-scale distributed systems, in- 
cluding delay-tolerant networks, sensor networks, peer-to-peer networks, collaborative com- 
puting, cloud computing, and so on. However, work on eventual consistency was mostly 
ad-hoc so far. Although some of our CRDTs were known before in the literature or in the 
folklore, this is the first work to engage in a systematic study. We believe this is required if 
eventual consistency is to gain a solid theoretical and practical foundation. 

Future work is both theoretical and practical. On the theory side, this will include un- 
derstanding the class of computations that can be accomplished by CRDTs, the complexity 
classes of CRDTs, the classes of invariants that can be supported by a CRDT, the relations 
between CRDTs and concepts such as self-stabilisation and aggregation, and so on. On the 
practical side, we plan to implement the data types specified herein as a library, to use them 
in practical applications, and to evaluate their performance analytically and experimentally. 
Another direction is to support support infrequent, non-critical synchronous operations, 
such as committing a state or performing a global reset. We will also look into stronger 
global invariants, possibly using probabilistic or heuristic techniques. 
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